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ABSTRACT 

Joint analysis of data from multiple sources has the potential 
to improve our understanding of the underlying structures 
in complex data sets. For instance, in restaurant recom- 
mendation systems, recommendations can be based on rat- 
ing histories of customers. In addition to rating histories, 
customers' social networks (e.g., Facebook friendships) and 
restaurant categories information (e.g.. Thai or Italian) can 
also be used to make better recommendations. The task of 
fusing data, however, is challenging since data sets can be 
incomplete and heterogeneous, i.e., data consist of both ma- 
trices, e.g., the person by person social network matrix or 
the restaurant by category matrix, and higher-order tensors, 
e.g., the "ratings" tensor of the form restaurant by meal by 
person. 

In this paper, we are particularly interested in fusing data 
sets with the goal of capturing their underlying latent struc- 
tures. We formulate this problem as a coupled matrix and 
tensor factorization (CMTF) problem where heterogeneous 
data sets are modeled by fitting outer-product models to 
higher-order tensors and matrices in a coupled manner. Un- 
like traditional approaches solving this problem using al- 
ternating algorithms, we propose an all-at-once optimiza- 
tion approach called CMTF-OPT (CMTF-OPTimization), 
which is a gradient-based optimization approach for joint 
analysis of matrices and higher-order tensors. We also ex- 
tend the algorithm to handle coupled incomplete data sets. 
Using numerical experiments, we demonstrate that the pro- 
posed all-at-once approach is more accurate than the alter- 
nating least squares approach. 
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1. INTRODUCTION 

With the ability to access massive amounts of data as a re- 
sult of recent technological advances, e.g., the Internet, com- 
munication and mult i- media devices, genomic technologies 
and new medical diagnostic techniques, we are faced with 
data sets from multiple sources. For instance, in restaurant 
recommendation systems, online review sites like Yelp have 
access to shopping histories of customers, friendship net- 
works of those customers, as well as categorizations of the 
restaurants. Similarly, for medical diagnoses, several types 
of data are collected from a patient; for example, EEC (elec- 
troencephalogram) and ECG (electrocardiogram) monitor- 
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Figure 1: Coupled Data Sets of Different Orders. Each ten- 
sor entry indicates the rating of a customer for a specific 
meal (i.e., breakfast, lunch, dinner) at a particular restau- 
rant. Matrices Y and Z show restaurant categories (i.e., 
Thai, Chinese, Italian) and social network information, re- 
spectively. Thus, the third-order tensor X of type restaurant 
by meal by customer can be coupled with the restaurant by 
category matrix Y and the customer by customer matrix Z. 



ing data, fMRI (functional Magnetic Resonance Imaging) 
scans, and other data gathered from laboratory tests. 

Analysis of data from multiple sources requires handling of 
data sets of different orders [7, 25, 30], i.e., matrices and/or 
higher-order tensors. For instance, Banerjee et al. [7], dis- 
cusses the problem of analyzing heterogeneous data sets with 
a goal of simultaneously clustering different classes of enti- 
ties based on multiple relations, where each relation is rep- 
resented as a matrix (e.g., movies by review words matrix 
showing movie reviews) or a higher-order tensor (e.g., movies 
by viewers by actors tensor showing viewers' ratings). Sim- 
ilarly, coupled analysis of matrices and tensors has been a 
topic of interest in the areas of community detection [22], 
collaborative filtering [36], and chemometrics [30]. 

As an example of data sets from multiple sources, without 
loss of generality, suppose we have a third-order tensor, X G 



' , and a matrix, Y G 



, coupled in the first 



dimension (mode) of each. The common latent structure 
in these data sets can be extracted through coupled matrix 
and tensor factorization (CMTF), where an i?-component 
CMTF model of a tensor X and a matrix Y is defined as: 



/(A, B, C, V) = II X - [A, B, C] ||^+ Y - AV^ 



(1) 



where matrices A G R^''^,B G R-^""^ and C G R^""^ 
are the factor matrices of X extracted using a CANDE- 
COMP/PARAFAC (CP) model [10, 13, 15]. The CP model 
is one of the most commonly used tensor models in the lit- 
erature (for a list of CP applications, see reviews [3, 19]). 



Here, we use the notation X = JA, B, C] to denote the CP 
model. Similarly, matrices A and V G R^^^ are the factor 
matrices extracted from matrix Y through matrix factor- 
ization. The formulation in (1) easily extends to multiple 
matrices and tensors, e.g., as shown in Figure 1. We also 
note that we focus on the least squares error in this paper, 
but our algorithms can be extended to other loss functions 
such as Bregman information metric used in, e.g., [7]. We 
briefly illustrate two motivating applications of coupled ma- 
trix and tensor factorizations. 

Example 1: Clustering. Joint analysis of data from mul- 
tiple sources may capture fine-grained clusters that would 
not be captured by the individual analysis of each data set. 
Suppose that there is a set of customers and there are two 
sources of information about these customers, X G ]^^^'^^^ 
and Y G R^^^, one storing information about which items 
customers have bought over a period of time, and the other 
showing where customers live. Within this set of customers, 
there are 4 groups: {Gi , 6^2 , , ^4 } and each group consists 
of people who live in the same neighborhood and have an in- 
terest in similar items. Imagine that the matrix Y can only 
discriminate between (G1UG3) and (G2^G4). A rank-2 ma- 
trix SVD factorization of Y would yield factors that could 
be used to cluster the customers as in the top plot shown 
in Figure 2 (SVD), failing to fully separate the four groups. 
Conversely, imagine that the tensor X only has enough in- 
formation to discriminate between (Gi UG2) and {G3 UG4). 
In this case, a rank-2 CP factorization of the tensor would 
still only separate the data into two groups, albeit two dif- 
ferent groups, as illustrated in the middle plot in Figure 2 
(CP). If, however, we jointly factor the matrix and tensor 
simultaneously using CMTF, the four groups are completely 
separated, as shown in the bottom plot of Figure 2 (CMTF). 
Details of the data generation for this example are provided 
in the appendix. □ 
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Figure 2: Clustering based on matrix SVD factorization of 
Y vs. CP tensor factorization of X vs. coupled matrix-tensor 
factorization of X and Y. The subplots present the scatter 
plots showing the first factor plotted against the second fac- 
tor in the first mode. 



Example 2: Missing Data Recovery. CMTF can be 
used for missing data recovery when data from different 
sources have the same underlying low-rank structure (at 
least in one mode) but some of the data sets have miss- 
ing entries (Figure 3). If a matrix or a higher-order tensor 
has a low-rank structure, it is possible to recover the missing 
entries using a limited number of data entries [1, 9]. How- 
ever, if there is a large amount of missing data, then the 
analysis of a single data set is no longer enough for accu- 
rate data recovery. Here, we provide an example illustrating 
that missing entries can still be recovered accurately using 
CMTF even when the analysis of a single data set fails to 
do so. 
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Figure 3: Incomplete tensor X and matrix Y coupled in the 
first mode. 



Suppose we have a tensor X and a matrix Y computed as 
X = |[A,B,CJ and Y = AV"^, where matrices A G R^""^, 
B G R-^""^, C G R^""^ and V G R^""^ are generated using 
random entries drawn from the standard normal distribu- 
tion. X G R-^x-^x-^ is constructed by randomly setting M% 
of the entries of X to be missing, i.e., X = W * X, where the 
binary tensor W is the same size as tensor X and Wijk = 
if we want to set Xijk to missing. In order to recover the 
missing entries, one approach is to fit an i?-component CP 
model to X and use the extracted factors to recover missing 
entries. An alternative approach is to fit an i?-component 
CMTF model to X and Y by extracting a common factor 
matrix in the first mode and then make use of CMTF factors 
to recover the missing entries. 

Figure 4 illustrates how the recovery error behaves for 
different amounts of missing data. We observe that CP is 
accurate in terms of recovering missing entries if less than 
80% of the entries are missing. However, there is a sharp in 
error as we further increase the amount of missing entries. 
On the other hand, CMTF can compute factors with low 
recovery error for higher amounts of missing data; only for 
problems with more that 90% missing data does the recovery 
error increase^. □ 

In order to solve coupled matrix and tensor factorization, 
various algorithms have been proposed in the literature us- 
ing a variety of loss functions [7, 22]. These algorithms all 
propose an alternating scheme, where the basis for each en- 
tity type is determined one at a time. In this paper, we 
focus solely on the L2 loss function, which is often solved 
using alternating least squares (ALS). Especially when fit- 
ting tensor models, ALS is the most commonly used algo- 
rithm due to its speed and ease of implementation. On the 
other hand, ALS suffers from several problems: (i) It may 
fail to find the underlying components accurately if the num- 
ber of components is not correctly estimated [2, 32]; (ii) In 



""^Note that the amount of missing data where the recovery 
error makes a sharp increase may change depending on the 
values of /, J, K, V and R. For example, with small data 
sizes and large dealing with missing data is challenging [1]. 
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Figure 4: Missing data recovery using CP factorization of 
X vs. CMTF of X and Y. Tensor Completion Score 
is the recovery error measure used here and is defined as 

'^ll ' where X corresponds to the data recon- 

structed using the extracted factor matrices, i.e., A,B, and 
C, asX= [[A,B,C]. 



the presence of missing data, ALS-based imputation tech- 
niques may suffer from poor convergence [8] and do not scale 
to large-scale data sets [1]. Therefore, unlike the previous 
work using alternating least squares algorithms, we propose 
an all-at-once optimization approach solving for all variables 
simultaneously. Our contributions in this paper are 

• Developing an algorithm called CMTF-OPT (CMTF- 
OPTimization) based on first-order optimization to 
solve coupled matrix and tensor factorization problem 
for a given number of components (i?).^ 

• Extending CMTF-OPT algorithm to handle incom- 
plete data sets, i.e., data with missing or unknown 
entries. 

• Demonstrating that CMTF-OPT is more accurate than 
an alternating least squares approach using numerical 
experiments. 

This paper is organized as follows. In §2, we introduce the 
notation for tensors and tensor operations. After discussing 
the related work on coupled data analysis in §3, we introduce 
our CMTF-OPT algorithm and its extension to data with 
missing entries in §4. Section 5 describes numerical exper- 
iments and demonstrates how CMTF-OPT compares with 
CMTF-ALS in terms of capturing the underlying factors in 
coupled data sets. Finally, we conclude with future research 
directions in §6. 

2. NOTATION AND BACKGROUND 

Tensors of order > 3 are denoted by Euler script letters 
(X,y,2i), matrices are denoted by boldface capital letters 
(A, B, C), vectors are denoted by boldface lowercase letters 
(a, b,c), and scalars are denoted by lowercase letters (a, 
5, c). Columns of a matrix are denoted by boldface lower 
letters with a subscript, e.g., is the rth column of matrix 
A. Entries of a matrix or a tensor are denoted by lowercase 
letters with subscripts, i.e., the (ii, ^2, • • • , '^a^) entry of an 



^Determining the number of components R in CMTF re- 
mains as a challenge just like computing the tensor rank. 



N-way tensor X is denoted by Xi-^^i2---ij^ . 

Given a matrix A of size I x J, vec(A) stacks the columns 
of the matrix and forms a vector of length IJ: 



vec(A) 



Given two matrices A G 



ai 



aj 



and B G 



their 



Khatri-Rao product is denoted by A B and defined as 
columnwise Kronecker product. The result is a matrix of 
size (/J) X K and defined by 



A B = [ai bi a2 ( 



where denotes Kronecker product. For more details on 
properties of Kronecker and Khatri-Rao products, see [19]. 

An N-way tensor can be rearranged as a matrix; this is 
called matricization. The mode-n matricization of a ten- 
sor X G M^^ ''^2x--xi^, -g (denoted by X(^) and arranges the 
mode-n one-dimensional "fibers" to be the columns of the 
resulting matrix. 

Given two tensors X and y of equal size h x I2 x ■ ■ ■ x In , 
their Hadamard (elementwise) product is denoted by X * y 
and defined as 



^*l*2---*7V^*l*2---*7V 



for all in G {!,..., In} and n G {!,..., N}. Their inner 
product, denoted by (X,y ), is the sum of the products of 
their entries, i.e., 

(x,y) = y ^ y ^ ••• y ^ xi^i2---i^yiii2---iN • 

ii — l 22 = 1 *Ar = l 

For a tensor X of size /i x /2 x • • • x /at, its norm is || X || = 
^( X, X ). For matrices and vectors, || • || refers to the anal- 
ogous Frobenius and two-norm, respectively. 

Given a sequence of matrices A^^-* of size In x R for n = 
1, . . . , AT, the notation [A^^^ A^^^ . . . , A^^^] defines an h x 
I2 X • • • X In tensor whose elements are given by 



(n) 



which is NP-hard [14]. 



(iA«,A(^),...,A(-)i) =En4 



for in G {1, . . . , /n}, n G {1, . . . , N}. For just two matrices, 
this reduces to [A, B] = AB"^. 

3. RELATED WORK IN DATA FUSION 

Data fusion, also called collective data analysis, multi- 
block, multi-view or multi-set data analysis, has been a topic 
of interest in different fields for decades. First, we briefly dis- 
cuss data fusion techniques for multiple data sets each rep- 
resented as a matrix and then focus on techniques proposed 
for coupled analysis of heterogenous data sets. 

3.1 Collective Factorization of Matrices 

The analysis of data from multiple sources attracted con- 
siderable attention in the data mining community during 
the Netflix Prize competition [20], where the goal was to 
make accurate predictions about movie ratings. In order 
to achieve better rating predictions, additional data sources 
complementing user ratings such as tagging information have 
been exploited; e.g., users tag movies [33] as well as features 



while not "converged" do 

rescale all factors to unit Frobenious norm 
solve for A (for fixed B, C, V) 

miHA II [X(i) Y] - A [(C B)"^ V"^] 
solve for B (for fixed A, C, V) 

minB ||X(2) -B(C0 A)"^ ||^ 
solve for C (for fixed A, B, V) 

minc||X(3)-C(B0A)"^||' 
solve for V (for fixed A, B, C) 

minv II Y- AV"^ ||^ 
end while 



of movies such as movie types or movie players . Singh 
and Gordon [28] proposed Collective Matrix Factorization 
(CMF) to take advantage of correlations between different 
data sets and simultaneously factorize coupled matrices. Gi- 
ven two matrices X and Y of size I x M and I x L, respec- 
tively, CMF can be formulated as 

/(U, V, W) = II X - UV"^ IP + II Y - UW"^ IP , (2) 

where U, V and W are factor matrices of size I x M x R 
and L X R, respectively and R is the number of factors. 
This formulation is a special case of the general approach 
introduced in [28] , which extends to different loss functions. 
Earlier, Long et al. [24, 23] had also studied collective matrix 
factorization using a different matrix factorization scheme 
than X = UV^. The proposed approaches in those studies 
are based on alternating algorithms solving the collective 
factorization problem for one factor matrix at a time. 

Analysis of multiple matrices dates back to one of the 
earliest models aiming to capture the common variation in 
two data sets, i.e.. Canonical Correlation Analysis (CCA) 
[16]. Later, other studies followed CCA by extending it to 
more than two data sets [18], focusing on simultaneous fac- 
torization of Gramian matrices [21] and working on PGA of 
multiple matrices [17, 34]. Moreover, several approaches for 
simultaneous factor analysis have been developed for spe- 
cific applications as well; e.g., population differentiation in 
biology [31], blind source separation [37], multimicrophone 
speech filtering [11], and microarray data analysis [4, 5, 6]. 

Tensor factorizations [3, 19, 29] can also be considered as 
one way of analyzing multiple matrices. For instance, when 
a tensor model is fit to a third-order tensor, multiple coupled 
matrices are analyzed simultaneously. Nevertheless, neither 
CMF nor tensor factorizations can handle coupled analysis 
of heterogeneous data, which we address next. 

3.2 Collective Factorization of Mixed Data 

As described in §1, heterogeneous data consists of data 
sets of different orders, i.e., both matrices and higher-order 
tensors. The formulation in (2) can be extended to het- 
erogeneous data sets: Given a tensor X and a matrix Y 
of sizes I X J X K and / x M, respectively, we can formu- 
late their factorization coupled in the first mode as shown 
in (1). Note that (1) can be considered as a special case of 
the approach introduced by Smilde et al. [30] for multi-way 
multi-block data analysis, where different tensor models can 
be fit to higher-order data sets and the factor matrices cor- 
responding to the coupled modes do not necessarily match. 
The same formulation as in (1) has recently been studied in 
psychometrics as Linked-Mode PARAFAC-PCA [35]. As a 
more general framework, not restricted to squared Euclidean 
distance, Banerjee et al. [7] introduced a mult i- way cluster- 
ing approach for relational and multi-relational data where 
coupled analysis of multiple data sets including higher-order 
data sets were studied using minimum Bregman informa- 
tion. The paper [22] also discussed coupled analysis of mul- 
tiple tensors and matrices using nonnegative factorization 
by formulating the problem using KL-divergence. All these 
studies propose algorithms that are based on alternating ap- 
proaches. 

In this paper we focus on the squared Euclidean distance 
as the loss function as in (1). Rewriting X = [A,B,C| as 
X(i) = A(C0B)"^, X(2) = B(C0 A)"^, etc., for the different 



Figure 5: CMTF-ALS: Alternating Least Squares Algorithm 
for coupled matrix and tensor factorization of a third-order 
tensor X and matrix Y coupled in the first mode. 



modes of X, we summarize the steps of an alternating least 
squares algorithm for solving (1) in Figure 5. The main 
loop in the algorithm is often terminated as a function of 
the relative change in function value (e.g., as defined below 
in (5)), a function of the relative change in factor matrices, 
and/or after a prescribed number of iterations. 

ALS-based algorithms are simple to implement and com- 
putationally efficient; however, ALS has shown to be error- 
prone when fitting a CP model in the case of overfactoring, 
i.e., the number of extracted components is more than the 
true number of underlying components [2, 32]. Furthermore, 
in the presence of missing data, ALS-based techniques may 
have poor convergence [8] and do not scale to very large 
data sets [1]. On the other hand, all-at-once optimization, 
in other words, solving for all CP factor matrices simultane- 
ously has shown to be more robust to overfactoring [2, 32] 
and easily extends to handle data with missing entries even 
for very large data sets [1]. Therefore, in order to deal with 
these issues, we develop an algorithm called CMTF-OPT, 
which formulates coupled analysis of heterogeneous data sets 
just like [30, 35] but solves for all factor matrices simultane- 
ously using a gradient-based optimization approach. 

4. CMTF-OPT ALGORITHM 

In this section, we consider joint analysis of a matrix and 
an A/'th-order tensor with one mode in common, where the 
tensor is factorized using an i?-component CP model and 
the matrix is factorized by extracting R factors using ma- 
trix factorization. Let X G M^^''^^ and Y G M^^''^ 
have the nth mode in common, where n G {1, . . . , A/"}. The 
objective function for coupled analysis of these two data sets 
is defined by 

/(a«,a(^),...,aw,v) 

= 1 II X - IA(^ . . . , aW] f + ^ II Y - A(")V^ f (3) 

Our goal is to find the matrices A^*^ G M^^^^ for i = 
1, 2, ...A/" and matrix V G R^^'^ that minimize the objective 
in (3). In order to solve this optimization problem, we can 
compute the gradient and then use any first-order optimiza- 
tion algorithm [27]. Next, we discuss the computation of the 
gradient for (3). 



We can rewrite (3) as two components, /i and /2: 



X- 



.(1) 



||2 1 



/l(A(l),A(2),...,A(^)) 



/2(A(-),V) 



The partial derivative of /i with respect to A^*^ has been 
derived in [2] so we just present the results here. 
Let Z= [A(^\...,AW|, then 



where 



(Z(i) -X(i))A 



(-0 



A^'+^) A' 



(i-i) 



0- 



■0A' 



(1) 



for i = l,...,Ar. 

The partial derivatives of the second component, /2, with 
respect to A^*^ and V can be computed as 



df2 



-YV + A^-^) V"^V, for i = n, 
for i ^ n, 



|0 = -y^a(^)+va(^)V^). 

Combining the above results, we can compute the partial 
derivative of / with respect to factor matrix A^*-*, for i = 
1,2,..., AT, and V as: 



df 



dfi 



+ 



df2 



dV dV 

Finally, the gradient of /, which is a vector of size P = 
^(I2n=i + formed by vectorizing the partial 

derivatives with respect to each factor matrix and concate- 
nating them all, i.e.. 



V/ 



-(ala)) 



vec 
vec 



4.1 CMTF-OPT for Incomplete Data 

In the presence of missing data, it is still possible to do 
coupled analysis by ignoring the missing entries and fitting 
the tensor and/or the matrix model to the known data en- 
tries. Here we study the case where tensor X has missing 
entries as in Figure 3. Let W G M^^''^^ indicate the 
missing entries of X such that 



1 if Xi-^i 

if Xi-^^i 



is known, 
is missing. 



for all in G {!,..., In} and n G {!,..., N}. We can then 
modify the objective function (3) as 

/w(aW,a(^\...,a(^),v) 

= 1 II W * (X - [A(i>, . . . , A^l) f +^ II Y - a(") f 

/Wi(A(i),A(2),...,A(^)) 



The first term, /wi , corresponds to the weighted least 
squares problem for fitting a CP model while the second 
term stays the same as in (3). The partial derivative of /wi 
with respect to A^*^ can be computed as in [1] as 



dfv 



W(i) * X(i)) a(- 



for z = 1, . . . , A/". Since the partial derivatives of the second 
term do not change, we can write the partials of / with 
respect to A*^^^ for i = 1, 2, A^, and V as: 



Ofw 

dA^'^ 

dfw 

dv 



Mi 

aA(0 

dv 



df2 



forzG{l,...,iV}/M, 
for i = n. 



The gradient of /w, V/w, is also a vector of size P = 
R(Yln=i + ^) formed in the same way as 

V/. 

Once we have the function, / (or /w), and gradient, V/ 
(or V/w), values, we can use any gradient-based optimiza- 
tion algorithm to compute the factor matrices. For the re- 
sults presented in this paper, we use the Nonlinear Con- 
jugate Gradient (NCG) with Hestenes-Steifel updates [27] 
and the More-Thuente line search [26] as implemented in 
the Poblano Toolbox [12]. 



5. EXPERIMENTS 

We compare the performance of the proposed CMTF- 
OPT algorithm with the ALS-based approach (i.e., CMTF- 
ALS as shown in Figure 5) in terms of accuracy using ran- 
domly generated matrices and tensors. Our goal is to see 
whether the algorithms can capture the underlying factors 
in the data (i) when R = R factors are extracted, and (ii) 
in the case of overfactoring, e.g., when R = R -\- 1 factors 
are extracted, where R and R denote the extracted and true 
number of factors. 

5.1 Data Generation 

In our experiments, three different scenarios are used (see 
Table 1). In the first case, we have a tensor X and a matrix 

Y coupled in the first mode. In the second case, we have 
two third-order tensors X and y coupled in one dimension. 
The third case has a third-order tensor X and two matrices 

Y and Z such that the tensor shares one mode with each 
one of the matrices. 

To construct the tensors and matrices in each scenario, 
random factor matrices with entries following standard nor- 
mal distribution are generated and a tensor (or a matrix) 
is formed based on a CP (or a matrix) model. For ex- 
ample, for the first scenario, we generate random factor 
matrices A G R^""^, B G R-^""^ and C G R^""^, and 
form a third-order tensor X G R^x^x^ based on a CP 
model. Gaussian noise is later added to the tensor, i.e., 
X = [[A,B,CI + ryN^^^^^, where N G R^x-^x^ cor- 
responds to the random noise tensor and 77 is used to ad- 
just the noise level. Similarly, we generate a factor matrix 

Y G R^""^ and form matrix Y = AV"^ + r/N^"" " 



f-. The 

matrix N G R^^^ is a matrix with entries drawn from the 
standard normal distribution and is used to introduce dif- 
fering amounts of noise in the data. In our experiments, we 



use three different noise levels, i.e., 77 = 0.1,0.25 and 0.35, 
and the true number of factors is = 3. 

5.2 Performance Metric 

We fit an ^-component CMTF model using CMTF-OPT 
and CMTF-ALS algorithms, where R = R and R = R -\- 
1, and compare the algorithms in terms of accuracy; in 
other words, how well the factors extracted by the algo- 
rithms match with the original factors used to generate the 
data. For instance, for the first scenario described above, let 
A, B, C and V be the original factor matrices and A, B, C 
and V be the extracted factor matrices. We quantify how 
well the extracted factors match with the original ones using 
the factor match score (FMS) defined as follows: 

FMS = min(l ^^"^ ^"^J )|aja^bjb^cjc^vjv^|. (4) 

^ max(^r,^^) 

Here, the columns of factor matrices are normalized to unit 
norm and denotes the weight for each factor r, for r = 
1, 2, ...R. The weight for each factor is computed as follows: 
We can rewrite X = JA, B, C] as X = X^^^i Ara^ o o c^, 
where o denotes n-mode vector outer product; the columns 
of the factor matrices are normalized to unit norm and Xr is 
the product of the vector norms in each mode. Similarly, we 
can rewrite Y = AV^ as Y = o^rSir o Vr by normaliz- 

ing the matrix columns and using ar to denote the product 
of vector norms. We define the norm of each CMTF com- 
ponent, i.e., ^r, as = Ar + ttr. lu (4) , and indicate 
the weights corresponding to the original and extracted rth 
CMTF component, respectively. If the extracted and origi- 
nal factors perfectly match, FMS value will be 1. It is also 
considered a success if FMS is above a certain threshold, i.e., 
(0.99)^, where N is the number of factor matrices. 

5.3 Stopping Conditions 

As a stopping condition, both algorithms use the relative 
change in function value and stop when 

I /current ~ /previous] <^]^Q — 8 
/previous 

where / is as given in (3). Additionally, for CMTF-ALS, the 
maximum number of iterations is set to 10^. For CMTF- 
OPT, the maximum number of function values (which cor- 
responds the number of iterations in an ALS algorithm) is 
set to 10^ and the maximum number of iterations is set to 
10^ CMTF-OPT also uses the two-norm of the gradient 
divided by the number of entries in the gradient and the 
tolerance for that is set to 10~^. In the experiments, the al- 
gorithms have generally stopped due to the relative change 
in function value criterion. However, we also rarely observe 
that both CMTF-OPT and CMTF-ALS have stopped since 
the algorithms have reached the maximum number of itera- 
tions, i.e., approximately 2% of all runs for CMTF-OPT and 
0.3% of all runs for CMTF-ALS. Furthermore, for around 
2% of all runs, CMTF-OPT has stopped by satisfying the 
condition on the gradient. 

5.4 Results 

We demonstrate the performance of the algorithms in 
terms of accuracy in Table 1 and Table 2. Table 1 presents 
the results for the experiments where factor matrices are 
generated with columns normalized to unit norm. For in- 
stance, for the example discussed above, this corresponds to 



generating tensor X and matrix Y using Xr — ar — 1, for 
r = 1,2, ..R. In Table 1, for all different scenarios, we ob- 
serve that when the correct number of underlying factors are 
extracted, i.e., R = R, the success ratios of both algorithms 
are compatible. Since we repeat our experiments with 30 dif- 
ferent sets of factor matrices for each set of parameters, we 
report the average factor match score for 30 runs. The aver- 
age scores for both algorithms are quite close and p-values 
computed for paired-sample t-tests indicate that differences 
in the scores are not statistically significant. On the other 
hand, for all scenarios, when we look at the cases where the 
data is overfactored, i.e., R = R -\- 1^ CMTF-OPT signif- 
icantly outperforms CMTF-ALS. While CMTF-OPT algo- 
rithm is quite accurate in terms of recovering the underlying 
factors in the case of over factoring, the accuracy of CMTF- 
ALS is quite low. We also observe that the increase in the 
noise level barely affects the accuracies of the algorithms. 

In Table 2, we present the results for a harder set of ex- 
periments, where the norm of each tensor and matrix com- 
ponent, i.e., Xr,ar for r — 1,2, ..i?, is a randomly assigned 
integer greater than or equal to 1.^ Similar to the previous 
case, we observe that CMTF-OPT is more robust to over- 
factoring compared to CMTF-ALS. However, the accuracies 
reported in this table are lower compared to Table 1. In 
particular, as the noise level increases, it becomes harder to 
find the underlying components and accuracies drop even if 
the true number of underlying factors are extracted from the 
data. For instance, for the first scenario, when the noise level 
is 0.35, the accuracy of CMTF-OPT is 60% for the case we 
extract the correct number of components. This is in part 
due to not being able to find ^r, for r = 1, 2, ..R, accurately. 
If we slightly change the FMS such that we only require 
miur la^eirb^brC^CrV^Vr^l to be greater than the threshold, 
then the average accuracy goes up to around 73%. The rest 
of the failing runs is due to the fact that extracted factors 
are distorted compared to the original factors used to gen- 
erate the data. Note that factor match scores are still high 
which indicates that the factors are only slightly distorted. 

6. CONCLUSIONS 

We have seen a shift in data mining in recent years: from 
models focusing on matrices to those studying higher-order 
tensors and now we are in need of models to explore and ex- 
tract the underlying structures in data from multiple sources. 
One approach is to formulate this problem as a coupled ma- 
trix and tensor factorization problem. In this paper, we 
address the problem of solving coupled matrix and tensor 
factorizations when we have squared Euclidean distance as 
the loss function and introduce a first-order optimization al- 
gorithm called CMTF-OPT, which solves for all factor ma- 
trices in all data sets simultaneously. We have also extended 
our algorithm to data with missing entries by introducing 
CMTF-WOPT for the case where we have an incomplete 
higher-order tensor coupled with matrices. The algorithm 
can easily be extended to multiple incomplete data sets. 

To the best of our knowledge, the algorithms proposed 
so far for fitting a coupled matrix and tensor factorization 
model have all been alternating algorithms. We have com- 
pared our CMTF-OPT algorithm with a traditional alter- 

^Each norm is chosen as the absolute value of a number ran- 
domly chosen from A/'(0, 25) rounded to the nearest integer 
plus 1. 



Table 1: Comparison of CMTF-OPT and CMTF-ALS for the cases where Xr = ar = ... = 1, for r = 1,2, ...R. 



Noise 



R 



ALG. 



Success Mean p-val 
(%) FMS 



Success 
(%) 



Mean 
FMS 



p-val 



Success 
(%) 



Mean 
FMS 



p-val 



0.10 



R 



R+1 



OPT 
ALS 
OPT 
ALS 



100.0 
100.0 



1.00 
1.00 



9.3e-l 



96.7 
96.7 



0.97 
0.97 



1.0 



100.0 
96.7 



1.00 
0.96 



96.7 
3.3 



0.97 
0.29 



3.3e-13 



100.0 
6.7 



1.00 
0.71 



1.6e-8 



96.7 
0.0 



0.97 
0.06 



3.3e-l 



3.8e-24 



T] = 0.25 



R 



OPT 
ALS 



100.0 
100.0 



0.99 
0.99 



4.5e-l 



100.0 
100.0 



1.00 
1.00 



0.2e-l 



96.7 
100.0 



0.96 
0.99 



R+l 



OPT 
ALS 



100.0 
6.7 



0.99 
0.34 



2.6e-ll 



100.0 
16.7 



1.00 
0.70 



100.0 
10.0 



0.99 
0.14 



3.3e-l 



7.0e-16 



0.35 



R 



OPT 
ALS 



100.0 
100.0 



0.99 
0.99 



5.8e-l 



100.0 
96.7 



1.00 
0.97 



3.3e-l 



100.0 
100.0 



0.99 
0.99 



i?+l 



OPT 
ALS 



90.0 
33.3 



0.92 
0.55 



1.8e-5 



100.0 
13.3 



1.00 
0.71 



l.Oe-7 



86.7 
36.7 



0.88 
0.39 



3.6e-l 



2.9e-5 



Table 2: Comparison of CMTF-OPT and CMTF-ALS for the cases where A^^, ar, ... > 1 for r = 1, 2, ...R. 



50 50 



W luu 



Noise 



R 



ALG. 



Success) Mean p-val 
(%) FMS 



Success 
(%) 



Mean 
FMS 



p-val 



Success 
(%) 



Mean 
FMS 



p-val 



T) = 0.10 



R 



OPT 
ALS 



96.7 
100.0 



0.96 
0.99 



3.3e-l 



100.0 
100.0 



1.00 
1.00 



9.1e-l 



96.7 
90.0 



0.96 
0.90 



R+1 



OPT 
ALS 



90.0 
3.3 



0.96 
0.24 



1.5e-13 



100.0 
13.3 



1.00 
0.52 



i.8e-9 



83.3 
10.0 



0.89 
0.13 



3.3e-l 



1.6e-ll 



0.25 



R 



OPT 
ALS 



76.7 
73.3 



0.97 
0.94 



3.3e-l 



96.7 
96.7 



0.97 
0.97 



1.0 



83.3 
70.0 



0.97 
0.84 



R+1 



OPT 
ALS 



86.7 
13.3 



0.97 
0.40 



6.4e-9 



100.0 
46.7 



1.00 
0.72 



7.8e-5 



76.7 
10.0 



0.90 
0.23 



0.4e-l 



1.4e-8 



T) = 0.35 



R 



OPT 
ALS 



60.0 
56.7 



0.95 
0.95 



3.6e-l 



100.0 
96.7 



1.00 
0.97 



3.3e-l 



53.3 
53.3 



0.92 
0.92 



i?+ 1 



OPT 
ALS 



46.7 
6.7 



0.87 
0.35 



1.7e-9 



100.0 
40.0 



1.00 
0.62 



4.0e-6 



50.0 
10.0 



0.83 
0.30 



7.4e-l 



4.9e-7 



nating least squares approach and the numerical results show 
that all-at-once optimization is more robust to overfactoring 
as it has also been the case for fitting tensor models [2, 32]. 

Note that our current formulation of coupled matrix and 
tensor factorization has one drawback, which is encountered 
when we have a data set whose factor matrices are shared 
by all other data sets. For instance, we may have a tensor 
X, where X = |[A,B,C| = X^f^i ^rSir o br o Cr and two 
matrices Y — AV^ = J^f^^ arSir o Vr and Z — BV""" = 
^^=iPrhr o Vr. If wc formulatc a coupled factorization 
for these data sets as /(A, B, C, V) = || X - [A, B, C] f + 
||Y-AV"^|| +||Z-BV"^||^ we do not take into account 
the cases for Ar / 7^ /^r- For such scenarios, scaling am- 
biguities should be taken into consideration by introducing 
a set of parameters for the scalars in the formulation. 



Another issue we have not addressed in this paper is how 
to weigh different parts of the objective function modeling 
different data sets as discussed in [35]. This is an area of 
future research where a Bayesian framework for coupled ma- 
trix and tensor factorizations may be a promising approach. 
As another area of future research, we plan to extend our ap- 
proach to different loss functions in order to be able to deal 
with different types of noise and incorporate nonnegativity 
constraints on the factors as nonnegativity often improves 
the interpretability of the model. 

7. ACKNOWLEDGMENTS 

This work was supported in part by the Laboratory Di- 
rected Research and Development program at Sandia Na- 
tional Laboratories. Sandia National Laboratories is a multi- 



program laboratory managed and operated by Sandia Cor- 
poration, a wholly owned subsidiary of Lockheed Martin 
Corporation, for the U.S. Department of Energy's National 
Nuclear Security Administration under contract DE-AC04- 
94AL85000. 

8. REFERENCES 

[1] E. Acar, D. Dunlavy, T. Kolda, and M. Morup. 
Scalable tensor factorizations for incomplete data. 
Chemometrics and Intelligent Laboratory Systems, 
106:41-56, 2011. 

[2] E. Acar, D. M. Dunlavy, and T. G. Kolda. A scalable 
optimization approach for fitting canonical tensor 
decompositions. J. Chemometrics, 25:67-86, 2011. 

[3] E. Acar and B. Yener. Unsupervised multiway data 
analysis: A literature survey. IEEE Trans. Knowledge 
and Data Engineering, 21(l):6-20, 2009. 

[4] O. Alter, P. O. Brown, and D. Botstein. Generalized 
singular value decomposition for comparative analysis 
of genome-scale expression data sets of two different 
organisms. PNAS, 100(6):3351-3356, 2003. 

[5] L. Badea. Combining gene expression and 
transcription factor regulation data using 
simultaneous nonnegative matrix factorization. In 
Proc. BIOCOMP-2007, pages 127-131, 2007. 

[6] L. Badea. Extracting gene expression profiles common 
to colon and pancreatic adenocarcinoma using 
simultaneous nonnegative matrix factorization. In 
Pacific Symposium on Biocomputing, pages 267-278, 
2008. 

[7] A. Banerjee, S. Basu, and S. Merugu. Multi-way 
clustering on relation graphs. In SDM'07, pages 
145-156, 2007. 
[8] A. M. Buchanan and A. W. Fitzgibbon. Damped 
Newton algorithms for matrix factorization with 
missing data. In CVPR'05, pages 316-322, 2005. 
[9] E. J. Candes and T. Tao. The power of convex 
relaxation: Near-optimal matrix completion. IEEE 
Trans. Information Theory, 56:2053-2080, 2009. 

[10] J. D. Carroll and J. J. Chang. Analysis of individual 
differences in multidimensional scaling via an N-way 
generalization of "Eckart- Young" decomposition. 
Psychometrika, 35:283-319, 1970. 

[11] S. Doclo and M. Moonen. Gsvd-based optimal filtering 
for single and multimicrophone speech enhancement. 
IEEE Transactions on Signal Processing, 
50(9):2230-2244, 2002. 

[12] D. M. Dunlavy, T. G. Kolda, and E. Acar. Poblano 
vl.O: A Mat lab toolbox for gradient-based 
optimization. Technical Report SAND2010-1422, 
Sandia National Laboratories, Mar. 2010. 

[13] R. A. Harshman. Foundations of the PARAFAC 
procedure: Models and conditions for an 
"explanatory" multi- modal factor analysis. UCLA 
Working Papers in Phonetics, 16:1-84, 1970. 

[14] J. Hastad. Tensor rank is NP-complete. J. Algorithms, 
ll(4):644-654, 1990. 

[15] F. L. Hitchcock. The expression of a tensor or a 
polyadic as a sum of products. J. Mathematics and 
Physics, 6(1):164-189, 1927. 

[16] H. Hotelling. Relations between two sets of variates. 
Biometrika, 28:321-377, 1936. 



[17] H. Kargupta, W. Huang, K. Sivakumar, B.-H. Park, 

and S. Wang. Collective principal component analysis 

from distributed, heterogeneous data. In PKDD'OO, 

pages 452-457, 2000. 
[18] J. R. Kettenring. Canonical Analysis of Several Sets of 

Variables. PhD thesis. University of North Carolina at 

Chapel Hih, 1969. 
[19] T. G. Kolda and B. W. Bader. Tensor decompositions 

and applications. SIAM Review, 51(3):455-500, 2009. 
[20] Y. Koren, R. Bell, and C. Volinsky. Matrix 

factorization techniques for recommender systems. 

IEEE Computer, 42(8):30-37, 2009. 
[21] J. Levin. Simultaneous factor analysis of several 

gramian matrices. Psychometrika, 31:413-419, 1966. 
[22] Y.-R. Lin, J. Sun, P. Castro, R. Konuru, 

H. Sundaram, and A. Kelliher. Metafac: community 

discovery via relational hypergraph factorization. In 

KDD'09, pages 527-536, 2009. 
[23] B. Long, X. Wu, Z. M. Zhang, and P. S. Yu. 

Unsupervised learning on k-partite graphs. In 

KDD'06, pages 317-326, 2006. 
[24] B. Long, Z. M. Zhang, X. Wu, and P. S. Yu. Spectral 

clustering for multi- type relational data. In ICML'06, 

pages 585-592, 2006. 
[25] I. V. Mechelen and A. K. Smilde. A generic 

linked-mode decomposition model for data fusion. 

Chemometrics and Intelligent Laboratory Systems, 

104(l):83-94, 2010. 
[26] J. J. More and D. J. Thuente. Line search algorithms 

with guaranteed sufficient decrease. ACM Trans. 

Mathematical Software, 20(3):286-307, 1994. 
[27] J. Nocedal and S. J. Wright. Numerical Optimization. 

Springer, 2006. 
[28] A. P. Singh and G. J. Gordon. Relational learning via 

collective matrix factorization. In KDD'08, 2008. 
[29] A. Smilde, R. Bro, and P. Geladi. Multi- Way Analysis: 

Applications in the Chemical Sciences. Wiley, West 

Sussex, England, 2004. 
[30] A. Smilde, J. A. Westerhuis, and R. Boque. Multiway 

multiblock component and covariates regression 

models. J. Chemometrics, 14:301-331, 2000. 
[31] R. S. Thorpe. Multiple group principal component 

analysis and population differentiation. J. Zoology, 

216(l):37-40, 1988. 
[32] G. Tomasi and R. Bro. A comparison of algorithms for 

fitting the PARAFAC model. Computational Statistics 

and Data Analysis, 50(7):1700-1734, 2006. 
[33] Z. Wang, Y. Wang, and H. Wu. Tags meet ratings: 

Improving collaborative filtering with tag-based 

neighborhood method. In lUVlO: Workshop on Social 

Recommender Systems, 2010. 
[34] J. A. Westerhuis, T. Kourti, and J. F. Macgregor. 

Analysis of multiblock and hierarchical PGA and PLS 

models. J. Chemometrics, 12:301-321, 1998. 
[35] T. Wilderjans, E. Ceulemans, and I. V. Mechelen. 

Simultaneous analysis of coupled data blocks differing 

in size: A comparison of two weighting schemes. 

Computational Statistics and Data Analysis, 

53:1086-1098, 2009. 
[36] V. W. Zheng, B. Cao, Y. Zheng, X. Xie, and Q. Yang. 

Collaborative filtering meets mobile recommendation: 



A user-centered approach. In A A API 0, pages 
236-241, 2010. 
[37] A. Ziehe, P. Laskov, G. Nolte, and K.-R. Miiller. A 
fast algorithm for joint diagonahzation with 
non-orthogonal transformations and its application to 
blind source separation. Journal of Machine Learning 
Research, 5:777-800, 2004. 

APPENDIX 

Here we briefly describe the data generation for Example 1 
in §1. Tensor X and matrix Y coupled in the first mode are 
constructed as follows: 

• Step 1: We form factor matrices Ai G M^^^ and A2 G 
R'^^'^, where R — 2, such that in the first column of Ai, 
entries corresponding to the members of Gi and G2 are 
assigned 1 (plus noise) while entries corresponding to 
the members of G3 and G4 are assigned -1 (plus noise). 
It is the vice versa for the second column; in other 
words, G3 and G4 members have 1 + noise values. The 
columns of A2 are generated similarly, except that in 
this case Gi and G3 form one cluster while G2 and G4 
form another. 

• Step 2: Factor matrices B G R'^''^,C G R^""^ and 
V G M^^^ are generated using random entries follow- 
ing standard normal distribution. 

• Step 3: All columns of factor matrices are normal- 
ized to unit norm. X and Y are constructed as X = 
[[Ai,B,C] and Y = AsV"^. 



